This is problem set #2, in which we hope you will practice the visualization package ggplot2, as well as hone your knowledge of the packages tidyr and dplyr.

Part 1: Basic intro to ggplot

Part 1A: Exploring ggplot2 using qplot

Note, that this example is from the_grammar.R on http://had.co.nz/ggplot2 I’ve adapted this for psych 254 purposes

First install and load the package.

#install.packages("ggplot2")
library(ggplot2)
library(reshape2)
library(bootstrap)
library(lme4)
## Loading required package: Matrix
## Loading required package: Rcpp
library(stringr)
library(lubridate)
library(plyr) # first load plyr then dplyr from https://github.com/hadley/dplyr
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:lubridate':
## 
##     here
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## 
## The following object is masked from 'package:Matrix':
## 
##     expand

Now we’re going to use qplot. qplot is the easy interface, meant to replace plot. You can give it simple qplot(x,y) examples, or slightly more complex examples like qplot(x, y, col=grp, data=d).

We’re going to be using the diamonds dataset. This is a set of measurements of diamonds, along with their price etc.

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
qplot(diamonds$carat, diamonds$price)

Scatter plots are trivial, and easy to add features to. Modify this plot so that it uses the dataframe rather than working from variables in the general namespace (good to get away from retyping diamonds$ every time you reference a variable).

d<-diamonds
qplot(carat, price, data = d)

Try adding clarity and cut, using shape and color as your visual variables.

qplot(clarity, price, data=d)

qplot(cut, price, data=d)

#qplot(color, price)

qplot(carat, price, shape = cut, color = clarity, data = d)

qplot(carat, price, shape = clarity, color = cut, data = d)
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 8.
## Consider specifying shapes manually. if you must have them.
## Warning: Removed 5445 rows containing missing values (geom_point).
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 8.
## Consider specifying shapes manually. if you must have them.

One of the primary benefits of ggplot2 is the use of facets - also known as small multiples in the Tufte vocabulary. That last plot was probably hard to read. Facets could make it better. Try adding a facets = x ~ y argument. x ~ y means row facets are by x, column facets by y.

Can’t get ggplot facet to work

qplot(carat, price, facets = cut ~ clarity, data = d)

qplot(carat, price, facets = clarity ~ cut, data=d)

But facets can also get overwhelming. Try to strike a good balance between color, shape, and faceting.

HINT: facets = . ~ x puts x on the columns, but facets = ~ x (no dot) wraps the facets. These are underlying calls to different functions, facet_wrap (no dot) and facet_grid (two arguments).

qplot(carat, price, color = clarity, facets = ~ cut, data = d)

qplot(carat, price, color = cut, facets = ~ clarity, data = d)

qplot(carat, price, color = cut, facets = . ~ clarity, data = d)

qplot(carat, price, color = clarity, facets = . ~ cut, data = d)

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(data=d, shape=1) + facet_grid(~cut)

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(data=d, shape=1) + facet_grid(color~cut)

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(data=d, shape=1) + facet_grid(color~cut)

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(data=d, shape=1) + facet_wrap(~cut)

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(data=d, shape=1) + facet_wrap(color~cut)

The basic unit of a ggplot plot is a “geom” - a mapping between data (via an “aesthetic”) and a particular geometric configuration on coordinate axes.

Let’s try some other geoms and manipulate their parameters. First, try a histogram (geom="hist").

qplot(price, geom = 'histogram', data = d)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(carat, geom = 'histogram', data = d)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(cut, price, geom = 'boxplot', data = d)

qplot(clarity, price, geom = 'boxplot', data = d)

qplot(clarity, price, geom = 'boxplot', data = d)

ggplot(diamonds, aes(x=carat)) + geom_histogram(data=d, binwidth=2,colour="white")

ggplot(diamonds, aes(x=price)) + geom_histogram(data=d, binwidth=2,colour="white")

ggplot(diamonds, aes(x=price)) + geom_histogram(data=d, binwidth=1000,colour="pink")

Now facet your histogram by clarity and cut.

ggplot(diamonds, aes(x=price)) + geom_histogram(data=d, binwidth=1000,colour="pink") + facet_wrap(~cut)

ggplot(diamonds, aes(x=price)) + geom_histogram(data=d, binwidth=1000,colour="pink") + facet_wrap(~clarity)

ggplot(diamonds, aes(x=price)) + geom_histogram(data=d, binwidth=1000,colour="pink") + facet_wrap(cut~clarity)

I like a slightly cleaner look to my plots. Luckily, ggplot allows you to add “themes” to your plots. Try doing the same plot but adding + theme_bw() or + theme_classic(). Different themes work better for different applications, in my experience.

ggplot(diamonds, aes(x=price)) + geom_histogram(data=d, binwidth=1000,colour="pink") + facet_wrap(~clarity)+theme_classic()

ggplot(diamonds, aes(x=price)) + geom_histogram(data=d, binwidth=1000,colour="pink") + facet_wrap(~clarity)+theme_bw()

Part 1B: Exploring ggplot2 using ggplot

ggplot is just a way of building qplot calls up more systematically. It’s sometimes easier to use and sometimes a bit more complicated. What I want to show off here is the functionality of being able to build up complex plots with multiple elements. You can actually do this using qplot pretty easily, but there are a few things that are hard to do.

ggplot is the basic call, where you specify A) a dataframe and B) an aesthetic mapping from variables in the plot space to variables in the dataset.

d <- ggplot(diamonds, aes(x=carat, y=price)) # first you set the aesthetic and dataset
d + geom_point() # then you add geoms

d + geom_point(aes(colour = carat)) # and you can keep doing this to add layers to the plot

Try writing this as a single set of additions (e.g. one line of R code, though you can put in linebreaks). This is the most common workflow for me.

d2<-(ggplot(diamonds, aes(x = carat, y = price, colour = carat))
  + geom_point())
d2

(ggplot(diamonds, aes(x = carat, y = price, colour = clarity))
  + geom_point())

You can also set the aesthetic separately for each geom, and make some great plots this way. Though this can get complicated. Try using ggplot to build a histogram of prices.

ggplot(diamonds, aes(x = price)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(diamonds, aes(x = price, colour=clarity)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Part 2: Diving into real data: Sklar et al. (2012)

Sklar et al. (2012) claims evidence for unconscious arithmetic processing. We’re going to do a reanalysis of their Experiment 6, which is the primary piece of evidence for that claim. The data are generously contributed by Asael Sklar.

First let’s set up a few preliminaries.

library(tidyr)
library(dplyr)

sem <- function(x) {sd(x) / sqrt(length(x))}
ci95 <- function(x) {sem(x) * 1.96}

Data Prep

First read in two data files and subject info. A and B refer to different trial order counterbalances.

subinfo <- read.csv("../data/sklar_expt6_subinfo_corrected.csv")
d.a <- read.csv("../data/sklar_expt6a_corrected.csv")
d.b <- read.csv("../data/sklar_expt6b_corrected.csv")

Gather these datasets into long form and get rid of the Xs in the headers.

d.a.long <- gather(d.a, subid, rt, X1:X21)
d.a.long$subid <- sub('X', '', d.a.long$subid)
d.a.long$cond <- 'A'

d.b.long <- gather(d.b, subid, rt, X22:X42)
d.b.long$subid <- sub('X', '', d.b.long$subid)
d.b.long$cond <- 'B'

Bind these together. Check out bind_rows.

d.all.long <- bind_rows(d.a.long, d.b.long)
d.all.long$subid <- as.numeric(d.all.long$subid)
d.all.long$cond <- as.factor(d.all.long$cond)

Merge these with subject info. You will need to look into merge and its relatives, left_join and right_join. Call this dataframe d, by convention.

d <- left_join(d.all.long, subinfo)
## Joining by: "subid"
stopifnot(nrow(d) == nrow(d.all.long))

Clean up the factor structure.

d$presentation.time <- factor(d$presentation.time)
levels(d$operand) <- c("addition","subtraction")

Data Analysis Preliminaries

Examine the basic properties of the dataset. First, take a histogram.

d$rt <- as.numeric(d$rt)
hist(d$rt)

qplot(rt, geom = 'histogram', data = d)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(objective.test, geom = 'histogram', data = d)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Challenge question: what is the sample rate of the input device they are using to gather RTs?

sort(unique(d$target))
##  [1]  0  1  2  3  4  5  6  8  9 10 11 12 13 14
sort(unique(d$prime.result))
##  [1]  0  1  2  3  4  5  6  8  9 10 11 12 13 14

Sklar et al. did two manipulation checks. Subjective - asking participants whether they saw the primes - and objective - asking them to report the parity of the primes (even or odd) to find out if they could actually read the primes when they tried. Examine both the unconscious and conscious manipulation checks (this information is stored in subinfo). What do you see? Are they related to one another?

hist(subinfo$objective.test)

hist(subinfo$subjective.test)

(ggplot(d, aes(x = subjective.test, y = objective.test))
  + geom_point())

(ggplot(d, aes(x = subjective.test, y = rt, colour = congruent))
  + geom_point())
## Warning: Removed 237 rows containing missing values (geom_point).

(ggplot(d, aes(x = objective.test, y = rt, colour = congruent))
  + geom_point())
## Warning: Removed 237 rows containing missing values (geom_point).

(ggplot(d, aes(x = objective.test, y = subjective.test))
  + geom_point()) + stat_smooth(method="glm", family="binomial", se=F)

OK, let’s turn back to the measure and implement Sklar et al.’s exclusion criterion. You need to have said you couldn’t see (subjective test) and also be not significantly above chance on the objective test (< .6 correct). Call your new data frame ds.

ds <- d[d$objective.test < .6 & d$subjective.test == 0, ]
head(ds)
## Source: local data frame [6 x 13]
## 
##    prime prime.result target congruent  operand distance counterbalance
## 1 =1+2+5            8      9        no addition       -1              1
## 2 =1+3+5            9     11        no addition       -2              1
## 3 =1+4+3            8     12        no addition       -4              1
## 4 =1+6+3           10     12        no addition       -2              1
## 5 =1+9+2           12     11        no addition        1              1
## 6 =1+9+3           13     12        no addition        1              1
## Variables not shown: subid (dbl), rt (dbl), cond (fctr), presentation.time
##   (fctr), subjective.test (int), objective.test (dbl)

Sklar et al.’s analysis

Sklar et al. show a plot of a “facilitation effect” - the time to respond to incongruent primes minus the time to respond to congruent primes. They then show plot this difference score for the subtraction condition and for the two presentation times they tested. Try to reproduce this analysis.

HINT: first take averages within subjects, then compute your error bars across participants, using the sem function (defined above).

# subject data
d.subj <- aggregate(rt ~ subid + operand + presentation.time + congruent, ds, FUN = 'mean')
d.subj <- spread(d.subj, congruent, rt)
d.subj <- mutate(d.subj, facilitation = no - yes)

# get sem
d.summary <- aggregate(facilitation ~ operand + presentation.time, d.subj, FUN = mean)
names(d.summary)[3] <- 'mean_facilitation' 
d.sem <- aggregate(facilitation ~ operand + presentation.time, d.subj, FUN = sem)
d.summary$sem_facilitation = d.sem$facilitation

Now plot this summary, giving more or less the bar plot that Sklar et al. gave (though I would keep operation as a variable here. Make sure you get some error bars on there (e.g. geom_errorbar or geom_linerange).

ggplot(d.summary, aes(x = presentation.time, y = mean_facilitation)) + geom_bar(stat = 'summary', fun.y = 'mean') + facet_grid(.~operand) + geom_errorbar(aes(ymin = mean_facilitation - sem_facilitation, ymax = mean_facilitation + sem_facilitation), width = .5)
## Warning: Stacking not well defined when ymin != 0

What do you see here? How close is it to what Sklar et al. report? Do the error bars match? How do you interpret these data? #Here we see that there is facilitation for subtraction but not for addition when presented with a prime for 1700 ms. In other words, people are faster at saying a number when they are subconsciously primed with the same number (via subtraction problem) before the trial. However, they are slowed when this prime is an addtion problem.

Challenge problem: verify Sklar et al.’s claim about the relationship between RT and the objective manipulation check. #Here I ran the same analysis using the trials that objective teset results are greater than .6. The error bars are huge. There might be a slight faciliation effect but no differences betwen subtraction and addition trials.

ds2 <- d[d$objective.test > .6 & d$subjective.test == 0, ]
d.subj2 <- aggregate(rt ~ subid + operand + presentation.time + congruent, ds2, FUN = 'mean')
d.subj2 <- spread(d.subj2, congruent, rt)
d.subj2 <- mutate(d.subj2, facilitation = no - yes)
d.summary2 <- aggregate(facilitation ~ operand + presentation.time, d.subj2, FUN = mean)
names(d.summary2)[3] <- 'mean_facilitation' 
d.sem2 <- aggregate(facilitation ~ operand + presentation.time, d.subj2, FUN = sem)
d.summary2$sem_facilitation2 = d.sem2$facilitation
ggplot(d.summary2, aes(x = presentation.time, y = mean_facilitation)) + geom_bar(stat = 'summary', fun.y = 'mean') + facet_grid(.~operand) + geom_errorbar(aes(ymin = mean_facilitation - sem_facilitation2, ymax = mean_facilitation + sem_facilitation2), width = .5)

Your own analysis

Show us what you would do with these data, operating from first principles. What’s the fairest plot showing a test of Sklar et al.’s original hypothesis that people can do arithmetic “non-consciously”?

(ggplot(d, aes(x = distance, y = rt, colour = congruent))
  + geom_point())
## Warning: Removed 237 rows containing missing values (geom_point).

q
## function (save = "default", status = 0, runLast = TRUE) 
## .Internal(quit(save, status, runLast))
## <bytecode: 0x10feb2440>
## <environment: namespace:base>

Challenge problem: Do you find any statistical support for Sklar et al.’s findings?